Diurnal Aggregation Algorithm #21

TinasheMTapera · 2025-10-02T14:58:30Z

This PR creates a pytask pipeline for diurnal aggregation. This is where we are able to aggregate data to night and day values based on the position of the sun. This is separate from the original snakemake + hydra pipeline, but the repo contains both.

Review Instructions

To review this PR, please first clone and install the package in a clean conda environment

conda create -n NAME python=3.12
pip install -e .

Then, symlink the data (in pytask, the bld folder is what they use for data)

ln -s /n/dominici_lab/lab/data_processing/csph-era5_sandbox/bld [YOUR DIRECTORY]/bld

Then, open the docs website to read the notebooks explaining the functionality. You can do this by right clicking _docs/index.html in VSCode and clicking "Show Preview". Alternatively, you can run the notebook code in the notes folder (they are identical).

Notebooks to review:

Pytask demo
Pytask config
Pytask download, pytask aggregate

Next, you can test out pytask in your terminal. Due to the large number of tasks, this can take up to 10 mins to run

# see the current status of all the tasks; may run long
pytask build --dry-run

# to filter tasks by a specific run, eg download 2010 data, use -k and boolean expressions 
pytask build --dry-run -k "download and 2010"

# understand where the tasks come from
pytask collect

# with more detail
pytask collect --nodes

Then, you can delete a file from bld and submit a pipeline job to run it. There is already an sbatch script setup for this to run in parallel:

sbatch pytask.sbatch

Improvements that could be made:

We could reduce the number of tasks by looping over the data catalog in groups of rows instead of by individual rows, but there is obviously a tradeoff of speed vs number of tasks

Closes #16 #3 #17 #19 #20

…se resolves Add CC-BY License to the ERA5 dataset #20

…ublish datasets to the harvard dataverse. Also first attempt at nbdev with quarto

…verse

…develop

…Readings

…tration of how to use `pytask` to manage data processing tasks in a Pythonic way, leveraging the power of decorators and type hints to define tasks and their dependencies

- Tested out pytask for building pipelines - Used the pytask data catalog to create sets of tasks as parameters to functions using namedtuples - Used the pytask data catalog to manage the parallelization of tasks - Created a pytask logger to log the progress of tasks - Implemented the download step of querying the ERA5 dataset in pytask - Began implementation of the aggregation step in pytask: - Used the astral library to find the time of sunrise and sunset for each data point in a query - Assigned a diurnal class to each data point based on the time of day - Aggregation of data points by date and diurnal class in progress

- Adopt Quarto for documentation and notebooks making use of [this nbdev PR](AnswerDotAI/nbdev#1521) that allows full `.qmd` driven packages - Convert all `ipynb` files to `.qmd` format - Use nbdev_docs to generate the documentation website - Adopt logger that solves #3 (#3)

This commit includes significant updates to the ERA5 data processing pipeline, focusing on using and demonstrating `pytask` as our workflow management tool. Key changes include: - Deleted obsolete log files for various datasets from 2015, 2017, 2019, 2021, and 2024. - Removed unnecessary Hydra configuration files and logs from the 2025-03-17 run. - Updated SLURM batch script to reduce maximum runtime from 18 hours to 6 hours. - Add the pytask `config.py` to introduce a demo data catalog and adjust data catalog structure. - Introduced the query object in `task_download.py` to handle data queries more effectively. - Add `task_aggregate.py` with a modified function to convert netCDF to GeoTIFF. - Refactored `task_download.py` to improve query handling and logging. - Cleaned up imports and improved code organization across multiple modules. - Updated documentation comments to reflect recent changes and maintain clarity. - Add nbdev quarto website documentation files.

…tatypes by solar date

…each xarray classified and resampled, but we need to convert to raster and then aggregate by polygon... not clear how to do this yet

… use DataFrame for diurnal classification. WIP: Continue trying to figure out how to rasterize xarray data so that they work with polygon_to_raster_cells function.

First, find the classifications of each point using sun position, then create two copies of the dataset with NaNs in the masked values, then resample by day. Importantly, you must set the time zone to the local time zone for the resampling by day to work correctly.

Parameterization now looks good by using pandas to create a dataframe of all combinations of parameters, filtering the ones that don't apply, and combining the parameters into a single dataframe that can be iterated over in the task function.

…kes a row from the jobs dataframe as input, which makes it easier to manage parameters. - The algorithm splits the data into day and night based on local time, which is determined from the longitude of the grid cell. - Remaining steps: change the query to use the new jobs dataframe, and update the notebook to reflect these changes; run and test the entire workflow to ensure everything works as expected; merge the aggregations into a single file per calendar month.

…lgorithm

- Separate the qmd and ipynb files for notes and processing to test pipeline integrity - Refactored `config.py` to enhance the data catalog structure and improve query handling. Data catalog now uses dataframes to manage jobs - Updated `download.py` to improve the download process and added checks for existing files. - Improved `pytask_logger.py` for better logging setup. - Enhanced `task_aggregate.py` to optimize aggregation tasks and ensure proper output handling. - Updated `task_data_preparation.py` to improve task definitions and exports. - Refined `task_download.py` to include checks for existing downloads and improve logging.

- Updated Jupyter Notebook metadata to enable execution of all cells. - Added a new core module for internal functions and testing, including utilities for path expansion, dynamic function importing, and directory structure creation. - Implemented a Google Drive authentication class for fetching healthshed files. - Created a ClimateDataFileHandler class to manage different file types from the Climate Data Store (CDS). - Added a testAPI function to validate API connections and configurations. - Updated aggregation module to use a specific example file for testing. - Refactored various notebooks to improve clarity and execution flow. - Removed unnecessary execution flags from multiple notebooks. - Enhanced the task_aggregate.py script to include raster calculations and aggregation to healthsheds.

… develop Fixes #16

TinasheMTapera added 22 commits June 26, 2025 13:00

Modify license in settings.ini to CC-BY-4.0 for publishing on dataver…

a4810b8

…se resolves Add CC-BY License to the ERA5 dataset #20

[FEATURE] Begin work on publish module for ERA5 dataset pipeline to p…

fc78422

…ublish datasets to the harvard dataverse. Also first attempt at nbdev with quarto

commit before adding develop updates

eaf739d

merging develop

1d582d6

added a module to upload a lego compliant version of the data to data…

1fa022a

…verse

Merge branch 'feature/18-Publish-the-era5-dataset-on-dataverse' into …

fb3d92a

…develop

Merge branch 'main' into develop

2794367

Merge branch 'develop' into feature/16-Add-Functionality-for-Diurnal-…

88da77e

…Readings

Test out pytask interactively in a Jupyter notebook. This is a demons…

8764e48

…tration of how to use `pytask` to manage data processing tasks in a Pythonic way, leveraging the power of decorators and type hints to define tasks and their dependencies

Major refactor of aggregation framework that accurately aggregates da…

3520b73

…tatypes by solar date

Continue progress on the algorithm. Currently, we have the values in …

d58700d

…each xarray classified and resampled, but we need to convert to raster and then aggregate by polygon... not clear how to do this yet

Enable DEV_MODE for quick development and update aggregation logic to…

bf06d0d

… use DataFrame for diurnal classification. WIP: Continue trying to figure out how to rasterize xarray data so that they work with polygon_to_raster_cells function.

Functioning pytask pipeline has been updated to include the diurnal a…

5cd38a7

…lgorithm

Merge branch 'feature/16-Add-Functionality-for-Diurnal-Readings' into…

e7e84dc

… develop Fixes #16

TinasheMTapera assigned audiracmichelle Oct 2, 2025

TinasheMTapera requested a review from audiracmichelle October 2, 2025 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diurnal Aggregation Algorithm #21

Diurnal Aggregation Algorithm #21

Uh oh!

TinasheMTapera commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Diurnal Aggregation Algorithm #21

Are you sure you want to change the base?

Diurnal Aggregation Algorithm #21

Uh oh!

Conversation

TinasheMTapera commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Instructions

Improvements that could be made:

Uh oh!

Uh oh!

TinasheMTapera commented Oct 2, 2025 •

edited

Loading